Skip to content

Conversation

@lewtun
Copy link
Member

@lewtun lewtun commented May 5, 2025

This PR bumps lighteval to use pass@1 metrics for all evals, with n > 1 samples per prompt to mitigate variance from repeated runs:

  • AIME24/25: 64 samples per prompt
  • GPQA: 8 samples per prompt
  • MATH_500: 4 samples per prompt

See huggingface/lighteval#698 for more details.

Note that I've updated the leaderboard to include the n values per benchmark to avoid breaking backwards compatibility:

Screenshot 2025-05-07 at 10 53 14

TODO

  • Update README with new eval scores

@lewtun lewtun changed the title [WIP] Use pass@1 for all evals Use pass@1 for all evals May 7, 2025
@lewtun lewtun requested a review from edbeeching May 7, 2025 08:53
@lewtun lewtun merged commit c802f00 into main May 9, 2025
1 check passed
@lewtun lewtun deleted the pass_at_1_with_large_n branch May 9, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants